We will embark in the study of wine data.
The data was obtained from https://www.google.com/url?q=https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityWhites.csv&sa=D&usg=AFQjCNHSo6vCJWIjCOZw6Kyy-C79XNFQUg
Let’s take a look at the variables.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
How about their type?
str(wine)
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
We see that quality is a numeric which will be a problem and we’ll address this below.
summary(wine)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
We can see the distribution of the quality of wine with this histogram.
hist(as.numeric(wine$quality))
Let’s look at some features. What might influence the quality of wine?
aggregate(sulphates ~ quality, wine, mean)
## quality sulphates
## 1 3 0.4745000
## 2 4 0.4761350
## 3 5 0.4822032
## 4 6 0.4911056
## 5 7 0.5031023
## 6 8 0.4862286
## 7 9 0.4660000
aggregate(alcohol ~ quality, wine, mean)
## quality alcohol
## 1 3 10.34500
## 2 4 10.15245
## 3 5 9.80884
## 4 6 10.57537
## 5 7 11.36794
## 6 8 11.63600
## 7 9 12.18000
So we see the average sulphates and alcohol amount for each quality.
I want to know which feature has the most variation?
sapply(wine, sd, na.rm=TRUE)
## X fixed.acidity volatile.acidity
## 1.414075e+03 8.438682e-01 1.007945e-01
## citric.acid residual.sugar chlorides
## 1.210198e-01 5.072058e+00 2.184797e-02
## free.sulfur.dioxide total.sulfur.dioxide density
## 1.700714e+01 4.249806e+01 2.990907e-03
## pH sulphates alcohol
## 1.510006e-01 1.141258e-01 1.230621e+00
## quality
## 8.856386e-01
It looks like total.sulfur.dioxide has the most variation and we’ll dig into this further below.
The main interest is to see what feature (or combination of features) of the dataset affects quality of wine the most.
From the description of the features, those that affect taste are: volatile acidity citric acid residual sugar chlorides total sulfur dioxide
The problem here is that an feature that is of interest– “quality” – should probably be a category. The value of the quality are 0,1,2,…10. The value is discrete– 10 being very good quality and 0 being bad quality.
We will change the ‘quality’ feature into a category below as part of the preprocessing step.
#Process the data to make quality be a category
wine$quality <- factor(wine$quality)
This is the result of ggpairs on the wine data.
.
In the last row, we can see that for high quality wine >= 8, those sensory features are low in quantity. The sensory features are: volatile acidity citric acid residual sugar chlorides total sulfur dioxide
Let’s look into this a bit more, e.g. with sulphates.
We can see that higher higher quality wines have less amount of sulphates with the histogram plot below.
ggplot(aes(x=total.sulfur.dioxide), data=wine) + geom_histogram() + facet_wrap( ~quality) + scale_fill_brewer(type = 'qual')
ggplot(aes(x=alcohol), data=wine) + geom_histogram(binwidth=0.1) + facet_wrap( ~quality) + scale_fill_brewer(type = 'qual')
ggplot(aes(x=sulphates/alcohol), data=wine) + geom_histogram(binwidth=0.1) + facet_wrap( ~quality) + scale_fill_brewer(type = 'qual')
ggplot(aes(x=citric.acid), data=wine) + geom_histogram(binwidth=0.1) + facet_wrap( ~quality) + scale_fill_brewer(type = 'qual')
From the ggpairs plot, we can see that some variables have interesting relationships with each other. For example, 0) citric acid increases as fixed acidity increases 1) pH decrease as fixed acidity increase 2) alcohol decreases as density increases 3) as chloride increase so does the density
Let’s take another look at total.sulfur.dioxide.
quality_group <- group_by(wine, quality)
summarize(quality_group, total_sulphates = mean(total.sulfur.dioxide),n = n())
## Source: local data frame [7 x 3]
##
## quality total_sulphates n
## 1 3 170.6000 20
## 2 4 125.2791 163
## 3 5 150.9046 1457
## 4 6 137.0473 2198
## 5 7 125.1148 880
## 6 8 126.1657 175
## 7 9 116.0000 5
The strongest relationship I found with quality is total.sulfur.dioxide.
The relationship between total sulfur dioxide vs quality and citric acid vs quality is similar in that they increase when quality increase.
This is seen from the ggpairs plot above.
Now, maybe the ratio of total sulfur dioxide to citric acid has an interesting relationship to quality. So we’ll try this.
new_df <- wine %>% group_by(quality) %>% mutate(total_sulfur_dioxide_over_citric_acid = total.sulfur.dioxide/citric.acid)
ggplot(aes(x=as.numeric(quality), y=total_sulfur_dioxide_over_citric_acid), data=new_df) + geom_point(fill=I('#F79420'), color=I('orange'), alpha = 0.5, position = position_jitter(h=0)) + scale_x_continuous() + scale_y_continuous(limits=c(0, quantile(new_df$total_sulfur_dioxide_over_citric_acid, 0.99))) + geom_line(stat = 'summary', fun.y = mean)
We can see that the ratio decreases as the quality increases.
It appears that all these sensory features strengthen each other: volatile acidity citric acid residual sugar chlorides total sulfur dioxide
We’ll check the acidity vs citric acid as they seem to be related.
# scatter
ggplot(aes(x=fixed.acidity, y=citric.acid), data=wine) + geom_point(fill=I('#F79420'), color=I('black'), alpha = 0.3, shape=21, position = position_jitter(w = 0.1, h = 0.1)) + scale_x_continuous(limits=c(3.8, quantile(wine$fixed.acidity, 0.99))) + scale_y_continuous(limits=c(0.0001, quantile(wine$citric.acid, 0.99))) + stat_smooth(method='lm')
Here we can see that citric acid increases as fixed acidity increases.
We’ll check the fixed.acidity vs pH as they seem to be related.
# scatter
ggplot(aes(x=fixed.acidity, y=pH), data=wine) + geom_point(fill=I('#F79420'), color=I('black'), alpha = 0.3, shape=21, position = position_jitter(w = 0.1, h = 0.1)) + scale_x_continuous(limits=c(3.8, quantile(wine$fixed.acidity, 0.99))) + scale_y_continuous(limits=c(2.720, quantile(wine$pH, 0.99))) + stat_smooth(method='lm')
## Warning: Removed 88 rows containing missing values (stat_smooth).
## Warning: Removed 123 rows containing missing values (geom_point).
Here we see pH decrease as fixed acidity increase.
# scatter
ggplot(aes(x=density, y=alcohol), data=wine) + geom_point(fill=I('#F79420'), color=I('black'), shape=21) + scale_x_continuous(limits=c(0.9871, 1.0390)) + scale_y_continuous(limits=c(7.9, 14.20)) + stat_smooth(method='lm')
## Warning: Removed 57 rows containing missing values (geom_path).
Here alcohol decreases as density increases.
# scatter
ggplot(aes(x=chlorides, y=density), data=wine) + geom_point(fill=I('#F79420'), color=I('black'), shape=21) + scale_x_continuous(limits=c(0, quantile(wine$chlorides, 0.99))) + scale_y_continuous(limits=c(0.9861, quantile(wine$density, 0.99))) + stat_smooth(method='lm')
As chlorides increase so does the density.
The wine data has over 4898 observations with features that describes how a wine may smell or taste. This is what we assume affects the quality (which is subjective in itself). Some preprocessing was needed to work with the data. The feature ‘quality’ was of interest but it was in numeric form instead of a factor. This feature is categorical; i.e. a wine may be labeled 0,1,2…10 depending on the quality. It appears that good quality wine have lower amounts of the sensory features than lower quality wine:
volatile acidity citric acid residual sugar chlorides total sulfur dioxide
The description of these variables from the data site suggests that these features affect the smell and/or taste. Now, there’s evidence that the ratio of some of these features affects the quality. It appears the higher quality wines have a lower total sulfur dioxide to citric acid.
As a follow up to this project, I would look at other ratios of other features.